Skip to content

feat(orchestrator): per-env sample strategy + env-mix seam#2722

Draft
hallerite wants to merge 1 commit into
feat/per-env-advantagefrom
feat/per-env-sampler
Draft

feat(orchestrator): per-env sample strategy + env-mix seam#2722
hallerite wants to merge 1 commit into
feat/per-env-advantagefrom
feat/per-env-sampler

Conversation

@hallerite
Copy link
Copy Markdown
Member

Stacked on #2721 (feat/per-env-advantage). Base this PR on that branch; review/merge it first.

What

Introduce the per-env sampling seam: each training env owns a SampleStrategy (what example to serve, plus an observe() feedback hook), and env selection is delegated to a swappable EnvMixStrategy. Defaults reproduce today's behavior; this is the foundation for curriculum / replay samplers.

Why

TrainSource previously hard-owned dataset iteration and env selection in one class, with no way to (a) plug a different per-env example-selection policy or (b) feed rollout outcomes back to the sampler. Splitting these into per-env + global strategies — and routing scored groups back via observe() — is what makes curriculum learning and (later) replay expressible without touching the dispatcher/perf path.

Changes

  • orchestrator/sampling.py (new): SampleStrategy ABC + ShuffledCursorSampler default (per-env: shuffle rows once, walk a reshuffling cursor); EnvMixStrategy ABC + WeightedRoundRobin default (which env next).
  • TrainEnv now owns its dataset via build_sampler() and holds a .sampler — reachable by both the source (pull) and the sink (observe).
  • TrainSource shrinks to: build per-env samplers + EnvMixStrategy; next_example picks an env then pulls from that env's sampler. (Folds in the env-mix extraction — the "slice b" seam.)
  • TrainSink.process_group calls env.sampler.observe(survivors) after advantages are assigned — the feedback wire (no-op for the default sampler).

Behavior

Behavior-equivalent to before: a weighted round-robin over per-env datasets that are each shuffled once and walked with a reshuffling cursor. The default observe is a no-op, so default runs train identically. (RNG is now partitioned per-env + mix rather than one shared generator, so the exact example sequence differs from before — same distribution, arbitrary seed; nothing depends on the old ordering.)

Testing

  • tests/unit/orchestrator/test_sampling.py (new, 8 tests): cursor cycles-without-repeats-then-reshuffles, determinism per seed, empty-rows guard, observe no-op, weighted-RR distribution + determinism.
  • ruff check + format --check clean; existing test_advantage.py (17) + test_configs.py (106) still pass.
  • Validated end-to-end on 2× RTX PRO 6000 (Blackwell): a 3-step multi-env reverse_text RL run with two envs (rt-grpo, rt-lenpen) — both sampled every step through the new EnvMixStrategy + per-env samplers (varying ratios), trained cleanly (Error 0.0%, exit 0), with the observe() wire firing per group.

🤖 Generated with Claude Code

Introduce the per-env sampling seam. Each train env owns a `SampleStrategy`
(what example to serve, plus an `observe()` feedback hook); env selection is
delegated to a swappable `EnvMixStrategy`. Defaults reproduce today's behavior
(weighted round-robin over per-env reshuffling-cursor datasets).

- `orchestrator/sampling.py` (new): SampleStrategy + ShuffledCursorSampler;
  EnvMixStrategy + WeightedRoundRobin.
- TrainEnv owns its dataset via `build_sampler()` and holds `.sampler`.
- TrainSource slims to env-mix + per-env samplers.
- TrainSink.process_group calls `env.sampler.observe(survivors)` after advantages
  (no-op default) — the feedback wire for curriculum / replay samplers.

Behavior-equivalent; RNG partitioned per-env + mix. Stacked on feat/per-env-advantage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant